17.7 Phylogenies

265

Word Occurrences

Once the single-nucleotide frequencies are known, it is possible to calculate the

expectations of the frequencies ofnn-grams assembled by random juxtaposition. Con-

straints on the assembly are revealed by deviations of the actual frequencies from

the expected values. This is the principle of the determination of dinucleotide bias. It

is, however, limited with regard to the inferences that may be drawn. For one thing,

as nn increases, the statistics become very poor. The genome of E. coli, for example,

is barely large enough to contain a single example of every possible 11-gram even

if each one was deliberately included. Furthermore, the comparison of actual fre-

quencies with expected ones depends on the model used to calculate the expected

frequencies. All higher order correlations are subsumed into a single number, from

which little can be said about the relative importance of a particular sequence.

It is possible to approach this problem more objectively (according to a maximum

entropy principle 23) by asking what is the most probable continuation of a given

nn-gram (cf. Eq. 6.21). Frequency dictionaries may be reconstructed from thinner ones

according to this principle; for example, if one wishes to reconstruct the dictionary

upper W Subscript nWn from upper W Subscript n minus 1Wn1, the reconstructed frequencies are 24

f overTilde Subscript i 1 comma ellipsis comma i Sub Subscript n Subscript Baseline equals StartFraction f Subscript i 1 comma ellipsis comma i Sub Subscript n minus 1 Subscript Baseline f Subscript i 2 comma ellipsis comma i Sub Subscript n Subscript Baseline Over f Subscript i 2 comma ellipsis comma i Sub Subscript n minus 1 Subscript Baseline EndFraction comma ˜fi1,...,in = fi1,...,in1 fi2,...,in

fi2,...,in1

,

(17.4)

wherei 1 comma ellipsisi1, . . . are the successive nucleotides in thenn-gram. The reconstructed dictio-

nary is denoted byupper W overTilde Subscript n Baseline left parenthesis n minus 1 right parenthesis

Wn(n1). The most unexpected, and hence informative,nn-grams

are then those with the biggest differences between the real and reconstructed fre-

quencies (i.e., with values of the ratio f divided by f overTilde f/ ˜f significantly different from unity).

17.7

Phylogenies

The notion that life-forms evolved from a single common ancestor (i.e., that the

history of life is a tree) is pervasive in biology. 25 Before gene and protein sequences

became available, trees were constructed from the externally observable character-

istics of organisms. Each organism is therefore represented by a point in phenotype

space. In the simplest (binary) realization, a characteristic is either absent (0) or

present (1) or is present in either a primitive (0) or an evolved (1) form. The distance

23 The entropy of a frequency dictionary is defined as

upper S Subscript n Baseline equals minus sigma summation Underscript j equals 1 Overscript Endscripts f Subscript j Baseline log f Subscript j Baseline periodSn = −

Σ

j=1

f j log f j .

(17.3)

24 Gorban et al. (2000).

25 The concept of phylogeny was introduced by E. Haeckel; see Sect. 14.9.